Extracting Group Velocity Dispersion values using quantum-mimic Optical Coherence Tomography and Machine Learning

Quantum-mimic Optical Coherence Tomography (Qm-OCT) images are cluttered with artefacts - parasitic peaks which emerge as a by-product of the algorithm used in this method. However, the shape and behaviour of an artefact are uniquely related to Group Velocity Dispersion (GVD) of the layer this artefact corresponds to and consequently, the GVD values can be inferred by carefully analysing them. Since for multi-layered objects the number of artefacts is too high to enable layer-specific analysis, we employ a solution based on Machine Learning. We train a neural network with Qm-OCT data as an input and dispersion profiles, i.e. depth distribution of GVD within an A-scan, as an output. By accounting for noise during training, we process experimental data and estimate the GVD values of BK7 and sapphire as well as provide a qualitative GVD value distribution in a grape and cucumber. Compared to other GVD-retrieving methods, our solution does not require user input, automatically provides dispersion values for all the visualised layers and is scalable. We analyse the factors affecting the accuracy of determining GVD: noise in the experimental data as well as general physical limitations of the detection of GVD-induced changes, and suggest possible solutions.

1 Neural network 1 The optimised initial model architecture consists of batch normalization 2 layer after each convolutional layer and each convolutional block is followed by an average pooling layers , one fully connected layer with 14336 units, and a dropout rate 3 of 0.1 followed by a layer normalisation 4 . We trained this model with batch size 16, learning rate 0.0001, optimiser Adam 5 , loss function Mean Absolute Error (MAE) and sigmoid as an output activation function.

Residual blocks
It has been shown 6 that incorporating residual blocks in VGG-16 architecture improves its performance. We tested several configurations of residual block integration within VGG-16. We found that instead of applying residual convolutional layers, a simple addition of the pooling layer and the convolutional output gives the lowest MAE loss.
The final model architecture is presented in SI Fig. 1. The purpose of an alpha layer (symbolised by a green dot with α character) is to expand the shape of average pooling output to match the shape of convolutional outputs before addition.

Output activation function
The network is trained on signals representing objects with GVD within the range [-5000, 5000] fs 2 /mm. We normalise the GVD values to be within the range of [0,1], where 0 corresponds to -5,000 fs 2 /mm, 1 -to 5,000 fs 2 /mm, and 0.5 -to 0 fs 2 /mm.
The retrieval of dispersion profiles from FFT stacks represents a regression problem that normally necessitates a linear activation in the output layer and then a normalisation layer to keep the model outputs within the range [0,1]. Instead, we decided to use a sigmoid activation function in the output layer because, surprisingly, it fits very well into our problem.
The sigmoid function (SI Fig. 2) returns 0.5 for the input argument value of 0, which reflects our situation in which the output 0.5 corresponds to the GVD equal to 0 fs 2 /mm. Also, sigmoid converges to 1 for high positive input argument values and toward 0 for high negative ones. This trend is also seen in our case, 1 corresponds to high positive GVD values and 0 corresponds to high negative GVD values.

Metrics
We use Mean Absolute Error (MAE) as a loss function to measure the mean absolute differences between the true and predicted values during training. To determine how well our predictions fit the ground truths, we created three goodness-of-fit metrics that are presented in Algorithms 1-3. The goodness-of-fit Algorithm 1 metric calculates the percentage of all prediction points being within the specified distance, dist, from the ground truth regardless of the GVD value. Algorithm 2 takes into account only ground truth values that are within the proximity of 0 fs 2 /mm GVD, more specifically within the distance of 0.01 corresponding to 50 fs 2 /mm, whereas Algorithm 3 calculates the goodness-of-fit for the GVD values outside of the -50 to 50 fs 2 /mm range. We created the latter two metrics to be able to separately asses the performance of the models in the presence of small and big GVD values (see Subsection 4.2 Table SI Table 2).
During training and in our further analysis, we used the default values of GoFA, GoFAZL, and GoFOZL parameters.

SNR analysis
We trained our model with several different datasets which represent a different level of noise. SI Fig. 3 presents the performance of the model after 100 epochs of training with datasets comprising signals with SNR of 25dB (SI Fig. 3 a,e), 30dB (SI Fig. 3 b,f), 35dB (SI Fig. 3 c,g), and 45dB (SI Fig. 3 d,h). Although in each case an overfitting seems to be observed, this is predominantly the problem for the 45dB dataset starting from epoch 67. From that epoch, the model becomes very unstable. For the 35dB dataset, the model also shows some minor instabilities, but it keeps improving over the course of the training. The 25dB model has no problems with instabilities. In case of the 30dB model, we observe a decrease in fluctuations and an overall improvement of the performance over time with no 3/8 overfitting at any epoch. Consequently, the higher the SNR of the training datasets, the less stable the training is. On the other hand, the higher the SNR, the better the predictions fit the true values as seen from the GoFA values in SI Fig. 3 e,f,g,h. SI Figure 4. A random test input FFT stack with a different level of incorporated noise: a) without noise, b) SNR = 25dB, c) SNR = 30dB, d) SNR = 35dB, and e) SNR = 45dB. The inputs were trimmed to the first 512 points.
This behaviour is linked to how the noise changes the OCT signals and consequently the inputs of the networks. SI Fig. 4 shows a random test FFT stack with a different level of noise, SI Fig. 4b with 25dB, SI Fig. 4c with 30dB, SI Fig. 4d with 35dB, and SI Fig. 4e with 45dB. We present the same FFT stack for the noise-free signal in SI Fig. 4a. In SI Fig. 4b, we see that a high level of noise visibly deteriorates the FFT stack. We observe the appearance of additional elements and the degradation or removal of other information. The noise level corresponding to the SNR of 30dB has a detrimental effect on the FFT stack (SI Fig. 4c) but the changes in the structural information are barely visible. None of these effects is visible in SI Fig. 4e, which closely resembles a noise-free FFT stack (SI Fig. 4a).
To illustrate how SNR changes the predictions, we randomly selected several samples from the test dataset, each sample corresponding to a different number of interfaces, and calculated the predictions with the model for each SNR level. In each case, the training lasted for 100 epochs. SI Fig. 5 shows the comparison between predictions (orange lines) and ground truth (blue lines). SI Fig. 5 column 1) presents the data for training with dataset with SNR equal to 25dB, column 2) 30dB, column 3) 35dB, and column 4) 45dB. SI Fig. 5a) is a 2-interface object, b) a 4-interface object, c) a 8-interface object, and d) a 12-interface object. SI Fig. 5 confirms previous statement that a higher SNR allows better predictions. Apart from the 45dB model being generally unstable, it provided very good results at the 100th epoch (SI Fig. 5 column 4), being the closest to the ground truths. Finally, we see that the predictions for the 25dB model (SI Fig. 5 column 1) are visibly further from the truth compared to other models.
It needs to be noted here that although higher SNR produces higher goodness-of-fit values, the model needs to perform well for a specific level of noise, i.e. the level of noise exhibited in the experimental data.

Metrics
Our further analysis was performed only on the model that provided the best results for our experimental data -the one trained with signals representing the SNR of 30dB. The analysis is based on a test dataset comprising 20,000 object samples that we SI Fig. 6 shows the performance of the model after 200-epoch-long training. We see that initial major fluctuations of loss and goodness-of-fit stop after 75 epochs and the model stabilises (SI Fig. 6a,c). In SI Fig. 6b,d,

Number of interfaces and GVD variability
Our analysis shows that, in addition to SNR, the number of interfaces is another parameter that affects loss and goodness-of-fit. Table SI Table 2 presents the performance of the model for a different number of interfaces. Table 2. Performance of 30dB SNR model values outside of the zero GVD for a test dataset. MAE (10 − 3), GoFA, GoFAZL and GoFOTZL values are shown separately for each number of interfaces. "All" column shows the mean value of the metrics for all interfaces. Values in the brackets are standard deviation of the results. The best results are marked in bold. We see a stark distinction between the results for objects with a high and a low number of interfaces. We notice that objects with numerous interfaces achieve a few times worse (=higher) MAE and significantly worse (=lower) GoFA, by up to 37%p between the highest GoFA for 2-interface objects and 12-interface objects. Although, according to the GoFAZL metric from Table SI Table 2, the model approximates quite well GVD close to 0 fs 2 /mm for every number of interfaces and with over 89% goodness-of-fit, we see that the goodness-of-fit for the rest of the values (Table SI Table 2, GoFOZL metric) is much lower, from 49.9% for 12-interface objects to just 86.9% in the case of the simplest objects with 2 interfaces, and thus, values that are further from 0 fs 2 /mm GVD have the greatest effect on the overall goodness-of-fit performance.

SI
We visualised GoFA values for objects with a different number of interfaces in SI Fig. 7. Among all test data samples, we did not observe one data sample with two interfaces that achieved 10-20% GoFA (SI Fig. 7d1). From the fact that we still see simplest objects with just two interfaces (SI Fig. 7a) with low GoFA, we deduced that there must be another factor besides SNR and the number of interfaces that affects the results.
We observed that the biggest deviations of predictions from the ground truth happens at the locations of the interfaces and the smaller the difference of GVD between the layers is, the worse the prediction of GVD gets. We checked how the spread of GVD within objects affects GoFA and presented the results in SI Fig. 8 for epoch 200.
The SI Fig. 8b-l confirm our previous conclusion that the number of interfaces has a major effect on GoFA. Predictions of simpler structures containing 2-4 interfaces (SI Fig. 8b-d) achieve higher mean GoFA results compared to the most complex object with 10-12 (SI Fig. 8j-l) which have the lowest GoFA values. On the same charts, we also see that standard deviation value of the GVD values within the object is another major factor that influences results. If we look at SI Fig. 8b, most examples with a high standard deviation, over 375 fs 2 /mm, have the highest goodness-of-fit. Similar observation can be made regardless of the numbers of interfaces in the object -we obtained better GoFA results for objects whose interfaces' GVD values differ more. This observation is also confirmed in the analysis presented in Fig. 5 in the main article.